On daily basis, we produced and encountered huge text data in spoken or written forms and different languages. However, the only language computer understand is numbers. So, to be efficient, we need to train computers to understand spoken and written words. This can be achieved through Natural language processing (NLP). NLP gives computers ability to understand written text and spoken words in much the same way human beings can. It enables computers to process human language the form of text or voice data and to ‘understand’ its full meaning, complete with the speaker or writer’s intent and sentiment.
en_us.blog, en_us.new and en_us.twitter, for file sizes, number of characters, words and lines.library(tidytext, warn.conflicts = FALSE)
library(tidyverse, warn.conflicts = FALSE)
library(stringi, warn.conflicts = FALSE)
library(plotly, warn.conflicts = FALSE)
library(qdapRegex, warn.conflicts = FALSE)
library(wordcloud, warn.conflicts = FALSE)
library(RColorBrewer, warn.conflicts = FALSE)
library(syuzhet, warn.conflicts = FALSE)
library(SentimentAnalysis, warn.conflicts = FALSE)
library(sentimentr, warn.conflicts = FALSE)
library(data.table, warn.conflicts = FALSE)
The sizes in megabites (MB) for en_us.blog, en_us.new and en_us.twitter are shown below.
Size
Blogs File 200.4242
News File 196.2775
Twitter File 159.3641
The three data file will be imported and sample taken for further analysis. We will remove profane word by filtering our data with words in the profanity_txt file.
setwd("C:/Users/justi/Documents/Olu_Drive/Coursera/Data_Science_Statistics_and_Machine_Learning_Specialization/Capstone/en_US")
blogs_txt <- readLines("en_US.blogs.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
news_txt <- readLines("en_US.news.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
twitter_txt <- readLines("en_US.twitter.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
profanity_txt <- readLines("profanity.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
profanity_df <- tibble(profanity_txt)
special_txt <- readLines("special.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
special_df <- tibble(special_txt)
Here we determined basic features of the data en_us.blog, en_us.new and en_us.twitter. How many characters (words, spaces and others), words, lines are in each data? Together with the files sizes, all are presented in the a table below.
| File Type | File Size | Number of Characters | Number of Words | Number of Lines |
|---|---|---|---|---|
| Blogs | 200.42 | 206824505 | 37546250 | 899288 |
| News | 196.28 | 15639408 | 2674536 | 77259 |
| 159.36 | 162096241 | 30093413 | 2360148 |
I will sample 0.5% of each data set blogs_txt, news_txt and twitter_txt to form a single set, sample1_txt. See below the first 3 lines of the sample-txt
[1] "Or put another way – in the spirit of this site’s mission – it’s all bollocks."
[2] "No Regrets for Our Youth – 0"
[3] "Tom: See you!"
First, we need to clean the data and remove irrelevant characters so we can concentrate on the important words from this file.
latin1ASII_func <- grep("latin1ASII", iconv(sample1_txt, "latin1", "ASCII", sub="latin1ASII"))
sample2_txt <- sample1_txt[-latin1ASII_func]
See below the first 5 lines of the clean set.
sample3_txt <- gsub("&", " ", sample2_txt)
sample3_txt <- gsub("RT :|@[a-z,A-Z]*: ", " ", sample3_txt) # remove tweets
sample3_txt <- gsub("@\\w+", " ", sample3_txt)
sample3_txt <- gsub("[[:digit:]]", " ", sample3_txt) # remove digits
sample3_txt <- gsub(" #\\S*"," ", sample3_txt) # remove hash tags
sample3_txt <- gsub(" ?(f|ht)tp(s?)://(.*)[.][a-z]+", " ", sample3_txt) # remove url
sample3_txt <- gsub("[^[:alnum:][:space:]']", "", sample3_txt) # Remove punctuation except apostrophes
sample3_txt <- rm_white(sample3_txt) # remove extra spaces using `qdapRegex` package
See below the first 5 lines of the clean set.
[1] "Tom See you"
[2] "See it's all the fault of evolution"
[3] "But seriously Wells Youngs WHAT IS THIS BULL CRAP ABOUT NOT SELLING IT IN THE UK UNTIL NEXT YEAR Get it sorted I want to be drinking this at Christmas"
A token is a meaningful unit of text, most often a word, that we are interested in using for further analysis, and tokenization is the process of splitting text into tokens.
We need to both break the text into individual tokens and transform it to a tidy data structure. This equivalent to a unigram \(1-gram\). Also, we need to filter out the profane word from the text corpus.
unigram <- sample_df %>%
unnest_tokens(word, text) %>%
filter(!word %in% profanity_df$profanity_txt) %>%
filter(!word %in% special_df$special_txt) %>% # Remove profane words
drop_na()
The count() will will be useful here. This will help use to visualize the dataset. See below five most frequent word and their frequencies \(n\).
unigram <- unigram %>%
count(word, sort = TRUE) %>%
mutate(word = reorder(word, n)) %>%
filter(n > 10)
head(unigram, 5)
# A tibble: 5 x 2
word n
<fct> <int>
1 the 10521
2 to 7141
3 i 5938
4 a 5843
5 and 5626
We use ggplot to generate the histogram and line graph below. Word occurence is more than 800 times.